Investigating the association of environmental exposures and all-cause mortality in the UK Biobank using sparse principal component analysis

Multicollinearity refers to the presence of collinearity between multiple variables and renders the results of statistical inference erroneous (Type II error). This is particularly important in environmental health research where multicollinearity can hinder inference. To address this, correlated variables are often excluded from the analysis, limiting the discovery of new associations. An alternative approach to address this problem is the use of principal component analysis. This method, combines and projects a group of correlated variables onto a new orthogonal space. While this resolves the multicollinearity problem, it poses another challenge in relation to interpretability of results. Standard hypothesis testing methods can be used to evaluate the association of projected predictors, called principal components, with the outcomes of interest, however, there is no established way to trace the significance of principal components back to individual variables. To address this problem, we investigated the use of sparse principal component analysis which enforces a parsimonious projection. We hypothesise that this parsimony could facilitate the interpretability of findings. To this end, we investigated the association of 20 environmental predictors with all-cause mortality adjusting for demographic, socioeconomic, physiological, and behavioural factors. The study was conducted in a cohort of 379,690 individuals in the UK. During an average follow-up of 8.05 years (3,055,166 total person-years), 14,996 deaths were observed. We used Cox regression models to estimate the hazard ratio (HR) and 95% confidence intervals (CI). The Cox models were fitted to the standardised environmental predictors (a) without any transformation (b) transformed with PCA, and (c) transformed with SPCA. The comparison of findings underlined the potential of SPCA for conducting inference in scenarios where multicollinearity can increase the risk of Type II error. Our analysis unravelled a significant association between average noise pollution and increased risk of all-cause mortality. Specifically, those in the upper deciles of noise exposure have between 5 and 10% increased risk of all-cause mortality compared to the lowest decile.

www.nature.com/scientificreports/ road length within 100 m buffer. Average daytime, evening time and night-time sound level of road traffic noise pollution were derived for year 2010 using the CNOSSOS model 23 .
Other environmental indicators included the proportion of green space, natural environment, domestic garden, and water within 300 m and 1000 m of residential addresses, using the 2005 Generalised Land Use Database for England and Centre for Ecology and Hydrology 2007 Land Cover Map data for Great Britain 24 . The buffer sizes were decided based on relevant health evidence and public policy on both density and accessibility. Coastal proximity was estimated using Euclidean distance raster 25 .
All the exposure indicators were only modelled or available to a single year, which may differ up to 4 years from recruitment. This may particularly affect air pollution and road traffic noise estimates, distributions of which tend to be spatially and temporally different. As with other studies 26,27 using these air pollution and noise data in UK Biobank, we made an assumption that whilst the absolute traffic volumes will have changed between earlier baseline periods and 2010, the relative difference in these exposures would likely have been spatially stable over this short period in the UK. This assumption is supported by findings for NO 2 air pollution in Great Britain, for which road traffic is a major source, where LUR-modelled NO 2 estimates for 2009 could be reliably back-extrapolated to earlier 1990s 28 . Between years 2010 and 2018, total annual emissions for PM 10 and PM 2.5 have been stable across the UK while emissions for NO 2 have proportionally decrease according to the official statistics 29 . While we cannot exclude the possibility of exposure misclassification, the decision of using singleyear annual average exposures at baseline to represent the annual average exposures during the entire follow-up period was deemed justifiable.
Additional covariates. In the regression analysis, we adjusted for a number of sociodemographic, socioeconomic, physiological, behavioural and lifestyle determinants of health. Specifically, we adjusted for age, sex, ethnicity, Townsend Deprivation Index, household income, qualifications, employment status, standing height, body mass index, average systolic blood pressure (SBP), average diastolic blood pressure (DBP), average pulse rate (PR), alcohol consumption and smoking status. Table 1 provides a descriptive summary of the cohort.
Health outcome. We used all-cause mortality as the outcome of interest. The date of death was extracted from the linked national death registries. An event was ascertained if death was recorded between the date of recruitment and the end of follow-up (censoring date: 1st May 2017). Fig. 1. shows the top 20 ICD10 codes that were registered as the primary causes of death.
Statistical analysis. We used SPCA, which was originally proposed by Zou and colleagues 18 . Our hypothesis is that the sparsity of principal components in SPCA can help overcome the limitation of PCA for identifying important stressors. The term 'sparse' in SPCA means that most of the coefficients in the loading matrix will be zeros, thus each derived principal components in SPCA will only be related to a small subset of the variables. Additionally, in contrast to PCA, each variable can only contribute to a small numbers of principal components in SPCA. These two features are expected to facilitate the interpretability of results. This is schematically demonstrated in Fig. 2, where x i ∈ R n is the vector of variables for the observation i . The arrows represent the loading matrix V ∈ R n×m and map the variables to principal components z i ∈ R m where often m ≪ n . Following this projection, a regression analysis may map the principal components to the outcome of interest, y. Standard statistical hypothesis testing methods can determine the significance of associations between the principal components, z, and the outcome, y, however, the dense mapping between the variables, x, and the principal components, z, mean these associations cannot be traced back to the variables. We expect SPCA to resolve this by providing a sparse loading matrix.
In order to achieve sparsity, SPCA penalizes the absolute value of the loadings at the cost of loss of information. A hyper-parameter, , is used to balance the trade-off between information loss and the sparsity of the loading matrix. Several implementations of SPCA have been proposed, here, we used the implementation reported by Erichson et al. 30 which uses the following formulation: where, v(B) represents the reconstruction error, ψ(B) is the penalty term which could be L1 norm (LASSO), L2 norm (RIDGE), or a combination of the two (elastic net). The hyperparameter λ controls the trade-off between the reconstruction error and sparsity; a larger value of λ produces a sparser model. Hereafter this parameter is denoted by SPCA to distinguish it from the penalty coefficient in the penalised regression model ( Cox ). The data matrix is denoted by X, B is the sparsely weighted matrix and A is an orthonormal matrix.

Results
We used Cox regression to evaluate the association of environmental variables with all-cause mortality after adjusting for the aforementioned covariates. We compared the results when, . We varied the coefficient of the L1 penalty term, Cox , between 0 and 2e-3 at 5e-5 intervals producing different levels of sparsity (supplementary materials: Fig. S1). (c) The environmental variables were transformed with PCA. The number of principal components was selected to explain 90% of the variance in the data, leading to seven principal components. The Cox regression model was fitted to the resulting principal components and other covariates (PCA Cox model hereafter); (d) We repeated step (c) using SPCA. The coefficient of the L1 penalty, SPCA , was selected to increase model parsimony. Increasing the value of SPCA results in principal components that consist of a smaller set of variables. To facilitate interpretability, we selected SPCA such that no two principal components share the same variable, in other words each variable at most contributes to one principal component. More details about the selection of SPCA is included in supplementary materials (Fig. S2). The number of principal components were similarly selected to explain 90% of the data variance, leading to seven principal components (SPCA Cox model hereafter). www.nature.com/scientificreports/ The number of follow-up years was the underlying time variable for all Cox models. Prior to the analysis, all numeric variables were examined for normality and outliers. Subsequently, they were standardised and values above or below five, were set to five. Figure 3a depicts the coefficient of the environmental variables in the Cox model. Multicollinearity in the Cox model results in high standard errors in the estimation of the coefficients, inhibiting reliable statistical inference. None of the environmental variables are found to be statistically significant. The detailed results are included in supplementary materials (Table S1). Figure 3b shows pairwise Pearson correlation between the variables. The block with high correlation coefficients pertains to the 20 environmental variables, underlining high collinearity within this class of variables. A moderate correlation is also observed between Townsend deprivation index and a number of environmental variables. A larger figure with detailed labels is included in supplementary materials (Fig. S3).
Adding  Figure 4 demonstrated the shrinkage of the log(HR) estimates and the 95% CI for different values of Cox . Figure 5, schematically compares PCA and SPCA results. The environmental variables are shown in the far left. The width of the links between the variables and the principal components are proportional with the loading coefficients. The links between the principal components and the outcome (i.e. all-cause mortality) are similarly proportional with the absolute value of the Cox coefficients (log(HR)). The associations that were found significant at α = 5% are highlighted in red. Detailed results are included in supplementary materials (Table S2).
In the PCA Cox model, the seventh component has a negative association with the outcome, however, given the complex interrelationship between the variables and principal components, it is not possible to disentangle this association. On the contrary, in the SPCA Cox model, the second component has a positive association with mortality and this can be easily traced back to the three constituting variables of this component. Specifically, this component is the average of the three variables representing average level of sound pollution in daytime,   www.nature.com/scientificreports/ evening, and night-time. One unit change in this principal component, corresponds to 2.47 dB increase in the average daily noise pollution (details in supplementary materials) and this is associated with HR:1.017 (95% CI: 1.004-1.030). Although the list of covariates that we adjusted for is much more comprehensive than previous studies and included some traffic-related stressors that were correlated with noise pollution, our result (HR: 1.07, 95%CI: 1.02-1.13, per 10 dB increase in the average daily noise) is in agreement with the previous studies 5, 7 .
To further verify this association, we investigated whether it persists across different exposure levels. To this end, as suggested by the previous analysis, the three aforementioned noise pollution variables were averaged; forming a new variable that represents average daily noise pollution. This was then categorised into deciles and the hazard ratios were calculated for the nine top deciles relative to lowest decile. The lowest decile represents noise pollution levels between 46.72 and 47.23 dB. To address the multicollinearity of the environmental covariates, the remaining 17 environmental variables were transformed to principal components explaining 0.92 percent of the variance. The model was adjusted for all other covariates. The results are depicted in Fig. 6, showing an upward trend which underlines the plausibility of a causal link. Descriptive summary of the subpopulations in each category and more details about the model is included in the supplementary materials (Table S3 and        www.nature.com/scientificreports/ interpretability of the derived representations, i.e. principal components, has been recognised as one of its major drawbacks. As shown in our results, in PCA, the entangled relationship between the principal components and the variables hinders the interpretation of findings. While some seek to mitigate this issue by deselecting nonimportant variables 32 or selecting variables more relevant to outcomes using supervised methods 33,34 , such interventions are not appropriate for statistical hypothesis testing where all relevant covariates should be adjusted for regardless of their contribution to predictive performance. Over the years, other dimensionality reduction methods have been widely applied in different disciplines. Random Projection 35 , Dictionary Learning 36 , Factor Analysis, Independent Component Analysis 37 , Non-negative Matrix Factorization (NMF) 38 are examples of these methods. Recently, Autoencoders including Denoising Autoencoder 39 and Variational Autoencoder 40 are increasingly used to learn a low dimensional representation of the input variables. But similar to PCA, the common limitation of these dimensionality reduction methods is the entangled relationship between the variables and the low dimensional representations. Enforcing sparsity in the transformation is recognised as an effective way to address this problem 41 . Inspired by this we investigated the use of SPCA for statistical hypothesis testing in the context of environmental health research and showed promising results. In light of the findings, we conclude that the integration of SPCA in statistical inference is a simple, computationally-efficient strategy for big data investigations when multicollinearity could lead to erroneous results.
Previously, environmental epidemiology studies have adopted dimensionality reduction methods, as well as one-stop methods such as Bayesian profile regression 42 to perform both dimensionality reduction and regression analysis for multiple pollutants. Some studies using PCA had previously identified a subset of air pollutants that were associated with mortality 34,43 . However, no studies have applied these statistical techniques to adjust for the wide range of environmental and non-environmental covariates that we considered in our analysis 44 . Neighbourhood-wide 45 and environment-wide 24 association studies (N/EWAS) have also been applied to high dimensional data in environmental epidemiology. These methods are inspired by genome-wide association studies 46 and their resources-intensiveness -in terms of data and computational power-hinders their wider adoption.
Our analysis led to a clear pattern of association between noise pollution and all-cause mortality. Noteworthy, the three indicators of noise pollution, day-time, evening, and night time noise levels, were combined into one principal component, all with the same weights. The resulting principal component (or a 2.47 dB increase in the average daily noise pollution) was associated with a HR:1.017 (95%CI:1.004-1.030) for all-cause mortality. This is translated to a HR: 1.07, 95%CI: 1.02-1.13 as per 10 dB increase in average noise level, in line with the only other study that showed a positive significant association between daily road traffic noise exposure and all-cause mortality (HR: 1.08, 95%CI: 1.04-1.12) 5 . A previous study in London reported the association between daytime noise and all-cause mortality in areas with noise pollution level greater than 60 dB compared to areas with noise pollution level less than 55 dB RR: 1.04 (95%CI: 1.00-1.07) 7 . While the hazard ratio calculated in our study is not directly comparable to the aforementioned studies, due to differences in the populations, study designs, data processing and covariates, the consistency of the findings are reassuring. Although the inclusion of correlated covariates can attenuate the significance of association, our results are largely in agreement with these studies, suggesting independent of gaseous pollutant, traffic-related stressors and other determinants, noise level is an important risk factor. Nonetheless, number of studies investigating the epidemiological link between road traffic noise exposure and all-cause mortality outcomes remains few, with a recent meta-analysis showing a weak association by pooling only five studies (HR: 1.01, 95%CI: 0.98-1.05) 47 .
A subsequent exposure-response analysis showed that the four highest exposure deciles are associated with significant risk of all-cause mortality compared to the lowest exposure decile. Limitations and future works. The key strengths of our analysis are, a large cohort, adjustment for a comprehensive list of environmental exposures, including correlated traffic-related exposures, which was facilitated by our methodological approach. This study has limitations. Firstly, we did not account for any potential non-linear exposure-response relationship. The inclusion of non-linear and interaction terms could reduce the risk of residual confounding. However, our primary objective was to study the usability of SPCA as a simple, computationally efficient and interpretable method to address collinearity. Second, as we already noted, exposure misclassification is inevitable for this type of study. Typically, if there was a true association with the health outcome, the effect estimates would be biased toward null for a classic random error. Third, SPCA approach is essentially a data-driven method without a priori variables hypotheses, without considering causal structures among the variables and/or variable-outcome links. In our study, 20 environmental exposures from UK Biobank were reduced in dimensionality using SPCA and were all used in the Cox regression under the assumptions of a causal structure linking each exposure and the outcome and the assumption that the exposures are independent of one another. However, in reality, some exposures may be on a specific causal pathway (e.g. traffic intensity-air pollution-mortality). It is beyond the scope of current study to investigate this complex causal structure which indeed requires a careful consideration of the causal inference analysis framework. Taking together all these limitations, the findings generated from our SPCA analysis are mainly exploratory and neither infers any potential causal relationship nor biological plausibility.

Conclusion
This study demonstrated that SPCA is a viable analytical approach to address, and enable interpretability of multiple environmental stressors-health associations. Using this method, our study further verified existing evidence on the association between noise as an important risk factor for adverse health outcomes in the UK Biobank. The strength of our analysis was observing this association even after adjusting for comprehensive list correlated stressors.

Data availability
The data that support the findings of this study are available from the UK Biobank but restrictions apply to the availability of these data, which were used under license for the current study. The raw data are only available to approved researchers via the UK Biobank.